Here I explore and analyze a data set that contains contains 1,599 red wines with 11 variables on the chemical properties of the wine: “fixed.acidity”, “volatile.acidity”, “citric.acid”,“residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol”, and “quality”. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The goal is to extract some conclusions about the relation between chemical properties and the quality of the wines.
I load some libraries
I load the DataFile
The next is going to be a preliminary exploration of the dataset. I have to understand the structure of the individual variables in the dataset. I first check the dimension, what variables I have, Min, Max, mean,medians in order to have an initial sense
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Initial Observations
1- PH is between 2.74 and 4.01. Median is 3.31
2- Max quality value is 8. Min is 3. Median is 6
3- Alcohol is between 8.4% and 14.9%.
4- Residual sugar range is large, between 0.9 and 15.5. But median is 2.2g/dm^3
Let’s make some plots:
Residual sugar, chlorides and sulfates have similar distributions
pH and density have a very similar distribution. Correlation can be important here.
One of the main interests in this study is to understand the chemical properties that produce a great quality wine.
I decided to break quality in 3 ranges and consider 7 and 8 high quality (since only 18 wines have quality) (0-5], (5-6], and (7-10] and check some chemical properties like pH and alcohol.
## (0,5] (5,6] (6,10]
## 744 638 217
Only 13.6% of wines have 6 or more quality value All quality wines have similar amount of pH. Not important factor. Same for sulphates and density. However, No good wines (6,10 range) have less than 10% of alcohol.
In thee following plots I want to understand the desviation of the data for some variables using the boxplot function
There are some points that could be considered outliers at >=14% alcohol or below 9% of alcohol. We should discuss if consider these datapoints when doing stadistics. Most of the wines are between 9 and 11
Most of the wines have between ~1.7 and 3.4 of residual sugar. In this case it looks like there are a huge amount of outliers. Probablyy we should not consider all wines above 8. Large deviations
Most of the wines have between 3 and 3.6 of pH. In this case the deviation is not as high as the residual sugar. Some outliers are below 2 or above 4.
Dataset is made by 1599 observables and 13 variables
The quality of the wine
The relation between the properties of the wine with the quality value.
Yes: quality.bucket in order to group better the quality of the wines.
Unusual distributions were: citric.acid and free.SO2 Normal distributions: density, pH, and vlatile.acidity Negative: sugar, total SO2, alcohol and sulphates
Yes. I created reds_short subset data to explore different bivariate plots and correlations with less amount of variables (next section)
Let’s check with ggpairs different Bivariate plots to have an initial sense of the highest correlations between variables in this DataSet
Notes about correlations: 1: There is a good correlation (0.668) between fixed acidity and density.
2: There is a good negative correlation (-0.683) between fixed acidity and pH.
3: There is a good correlation (0.668) between Total SO2 and free SO2.
4: There is a good correlation (0.672) between fixed acidity and citric.acid.
Let’s see some of this relations with their regresion lines
##
## Pearson's product-moment correlation
##
## data: reds$fixed.acidity and reds$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6399847 0.6943302
## sample estimates:
## cor
## 0.6680473
Density vs fixed acidity shows a R=0.668. There is still some big dispersion with datapoints that should be removed (density=0.99,acidity=8). Bur due to the large amount of data those point would not change to much the value of R.
##
## Pearson's product-moment correlation
##
## data: reds$fixed.acidity and reds$pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7082857 -0.6559174
## sample estimates:
## cor
## -0.6829782
There is a good negative correlation (-0.683) between fixed acidity and pH.
##
## Pearson's product-moment correlation
##
## data: reds$free.sulfur.dioxide and reds$total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6395786 0.6939740
## sample estimates:
## cor
## 0.6676665
There is a good correlation (0.668) between Total SO2 and free SO2.
Let’s focus now in the chemical properties of a good wine in comparison with other quality wine.
Alcohol increases with the quality of the wine
pH remains between 3.3 and 3.8 for all quality wines.
Less dispersion can be seen on chlorides as we go for high quality wines. Keeping them close to 0.1
It was interesting that just a few wines had below 10% of alcohol so I wanted to know more:
## reds$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## reds$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## reds$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## reds$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## reds$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## reds$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
While there is not too much wines with more than 12% of alcohol in the low range of quality, there are not below 10% in the high quality level.
I used a multiscatter plot to look for relationships between all chemicals. I also looked individually how each chemical is related with the quality of the wine. I found that we can only get some hints about how these chemicals can produce a good quality wine and it is more the sum of these chemicals. However, I found that the degree of alcohol increases with the quality of wine.
1: There is a good correlation (0.668) between fixed acidity and density.
2: There is a good negative correlation (-0.683) between fixed acidity and pH.
3: There is a good correlation (0.668) between Total SO2 and free SO2.
4: There is a good correlation (0.672) between fixed acidity and citric.acid.
5: Quality and level of alcohol
The fixed acidity with the pH
Good correlation between pH and fixed acidity. Where low pH correspond with low fixed acidity, and when density increases, fixed acidity too, but not pH.
Good correlation between alcohol and density. This correlation looks better for more quality wines although still a lot of dispersion. High alcoholic wines usually have more pH.
By considering a wine with the following qualities: alcohol>10% sulphates>0.70 pH<3.3 chlorides<0.075 volatile.acidity<0.4 total.sulfur.dioxide<40
## (0,5] (5,6] (6,10]
## 744 638 217
## (0,5] (5,6] (6,10]
## 2 11 27
We go from 13.6% of 7 or 8 quality wines to 67.5%
Based on this I create a model to predict the quality of a wine based on the amount of the different chemicals it has.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = reds)
## m2: lm(formula = quality ~ alcohol + pH, data = reds)
## m3: lm(formula = quality ~ alcohol + pH + sulphates, data = reds)
## m4: lm(formula = quality ~ alcohol + pH + sulphates + volatile.acidity,
## data = reds)
## m5: lm(formula = quality ~ alcohol + pH + sulphates + volatile.acidity +
## total.sulfur.dioxide, data = reds)
##
## ==============================================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 4.426*** 3.345*** 3.493*** 3.749***
## (0.175) (0.387) (0.401) (0.385) (0.387)
## alcohol 0.361*** 0.386*** 0.367*** 0.321*** 0.308***
## (0.017) (0.017) (0.017) (0.016) (0.017)
## pH -0.850*** -0.635*** -0.306** -0.319**
## (0.116) (0.116) (0.115) (0.115)
## sulphates 0.868*** 0.635*** 0.667***
## (0.104) (0.102) (0.102)
## volatile.acidity -1.156*** -1.130***
## (0.100) (0.100)
## total.sulfur.dioxide -0.002***
## (0.001)
## ----------------------------------------------------------------------------------------------
## R-squared 0.227 0.252 0.283 0.339 0.347
## adj. R-squared 0.226 0.251 0.282 0.337 0.345
## sigma 0.710 0.699 0.684 0.657 0.654
## F 468.267 268.888 210.183 204.210 169.271
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1694.466 -1660.297 -1595.858 -1585.956
## Deviance 805.870 779.508 746.896 689.059 680.578
## AIC 3448.114 3396.931 3330.594 3203.717 3185.913
## BIC 3464.245 3418.440 3357.480 3235.980 3223.553
## N 1599 1599 1599 1599 1599
## ==============================================================================================
-Low fixed acidity have lower pH as we supposed
-A high quality wine should have: low pH, low amount of chlorides, volatile.acidity, density and high amount of citric acidity, alcohol degree, and sulphates.
Alcohol is directly related with quality of wine. Fixed acidity is related with pH
Create a model to predict the quality of a wine based on the amount of the chemical variables but R^2=0.347 (too low). There is too much dispersion to arrive to an a conclusion like this based on the chemicals and the different taste of 3 persons. ——
## reds$quality.bucket: (0,5]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.400 9.700 9.926 10.300 14.900
## --------------------------------------------------------
## reds$quality.bucket: (5,6]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## reds$quality.bucket: (6,10]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.60 11.52 12.20 14.00
## reds$quality.bucket: (0,5]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.200 3.310 3.312 3.400 3.900
## --------------------------------------------------------
## reds$quality.bucket: (5,6]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.860 3.220 3.320 3.318 3.410 4.010
## --------------------------------------------------------
## reds$quality.bucket: (6,10]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.270 3.289 3.380 3.780
The amount of alcohol degree is important when considering a high quality wine. As we can see in the previous figure. Practicaly no wines with more than 12% of alcohol have qualities over 5. Being here, the wines with less than 10% the are the most. On the other side, wines with more than 12% of alcohol are abundant on quality wines of 7 or 8 values while there are almost no wines with lss than 10%. pH=3.3 (black solid line) is kept more or less in the middle of all distributions. We can see some displacement towards a loer pH when we go to high quality wines.
In this plot we can see how evolves the median for alcohol as a function of quality. A high quality wine should have high level of alcohol in general. However, dispersion is big.
We go from 13.6% of 7 or 8 quality wines to 67.5%
Low fixed acidity have high pH (blue dots) as we could have thought. Besides, from the previous figure we can also see that density increases with the fixed acidity increment. The correlation between these two variables is the highest I found in the DataSet.
I found several correlations between different chemicals on wine. I could understand, in general, what of these chemicals can have direct relationships with the quality of the wine. My struggles here were that, since I do not know too much of these chemicals (more literature research could have worked) I started the analysis quite blind. Only alcohol, pH and quality had sense for me Also, I wanted to find a model to predict the quality of the wine based on the amount of chemicals but it turns out that the correlation of that function was quite bad (R~0.6) -I think more persons to qualify the wine would have given better results. -Other variables like age of the wine could also be important.